LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning
نویسندگان
چکیده
Gradient-based distributed learning in parameter server (PS) computing architectures is subject to random delays due straggling worker nodes and possible communication bottlenecks between PS workers. Solutions have been recently proposed separately address these impairments based on the ideas of gradient coding (GC), grouping, adaptive selection. This article provides a unified analysis techniques terms wall-clock time, communication, computation complexity measures. Furthermore, order combine benefits GC grouping robustness stragglers with load gains selection, novel strategies, named lazily aggregated (LAGC) grouped-LAG (G-LAG), are introduced. Analysis results show that G-LAG best time performance while maintaining low computational cost, for two representative distributions times nodes.
منابع مشابه
Communication-Computation Efficient Gradient Coding
This paper develops coding techniques to reduce the running time of distributed learning tasks. It characterizes the fundamental tradeoff to compute gradients (and more generally vector summations) in terms of three parameters: computation load, straggler tolerance and communication cost. It further gives an explicit coding scheme that achieves the optimal tradeoff based on recursive polynomial...
متن کاملGradient Sparsification for Communication-Efficient Distributed Optimization
Modern large scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computational architectures. A key bottleneck is the communication overhead for exchanging information such as stochastic gradients among different workers. In this paper, to reduce the communication cost we propose a convex optimization formulation to minimize the coding...
متن کاملNear-Optimal Straggler Mitigation for Distributed Gradient Methods
Modern learning algorithms use gradient descent updates to train inferential models that best explain data. Scaling these approaches to massive data sizes requires proper distributed gradient descent schemes where distributed worker nodes compute partial gradients based on their partial and local data sets, and send the results to a master node where all the computations are aggregated into a f...
متن کاملGradient Coding: Avoiding Stragglers in Distributed Learning
We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for synchronous Gradient Descent. We implement our schemes in python (using MPI) to run on Amazon EC2, and show how we compare against baseline approaches in running time and ge...
متن کاملRedundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning
Performance of distributed optimization and learning systems is bottlenecked by “straggler” nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is “encoded” to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically left out of the computation at...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE transactions on neural networks and learning systems
سال: 2021
ISSN: ['2162-237X', '2162-2388']
DOI: https://doi.org/10.1109/tnnls.2020.2979762